# COMPSCI 389: Introduction to Machine Learning
# Topic 5.1 Evaluation Re-Visited

At the bottom of this notebook, start with the "Notice" and "Answer" markdown cells collapsed, if possible.

Recall the following code from before. It does the following:
1. Import relevant libraries
2. Define evaluation metrics
3. Define the KNearestNeighbors model
4. Define the WeightedKNearestNeighbors model

In [1]:
import pandas as pd
from sklearn.neighbors import KDTree
from sklearn.base import BaseEstimator
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
import numpy as np

def mean_squared_error(predictions, labels):
    return np.mean((predictions - labels) ** 2)

def root_mean_squared_error(predictions, labels):
    return np.sqrt(mean_squared_error(predictions, labels))

def mean_absolute_error(predictions, labels):
    return np.mean(np.abs(predictions - labels))

def r_squared(predictions, labels):
    ss_res = np.sum((labels - predictions) ** 2)        # ss_res is the "Sum of Squares of Residuals"
    ss_tot = np.sum((labels - np.mean(labels)) ** 2)    # ss_tot is the "Total Sum of Squares"
    return 1 - (ss_res / ss_tot)

ts = 0.05

class KNearestNeighbors(BaseEstimator):
    # Add a constructor that stores the value of k (a hyperparameter)
    def __init__(self, k=3):
        self.k = k

    def fit(self, X, y):
        # Convert X and y to NumPy arrays if they are DataFrames
        if isinstance(X, pd.DataFrame):
            X = X.values
        if isinstance(y, pd.Series):
            y = y.values

        # Store the training data and labels
        self.X_data = X
        self.y_data = y
        
        # Create a KDTree for efficient nearest neighbor search
        self.tree = KDTree(X)

        return self

    def predict(self, X):
        # Convert X to a NumPy array if it's a DataFrame
        if isinstance(X, pd.DataFrame):
            X = X.values

        # Query the tree for the k nearest neighbors for all points in X
        dist, ind = self.tree.query(X, k=self.k)

        # Return the average label for the nearest neighbors of each query
        return np.mean(self.y_data[ind], axis=1)
    
class WeightedKNearestNeighbors(BaseEstimator):
    # Add a constructor that stores the value of k and sigma (hyperparameters)
    def __init__(self, k=3, sigma=1.0):
        self.k = k
        self.sigma = sigma

    def fit(self, X, y):
        # Convert X and y to NumPy arrays if they are DataFrames
        if isinstance(X, pd.DataFrame):
            X = X.values
        if isinstance(y, pd.Series):
            y = y.values

        # Store the training data and labels
        self.X_data = X
        self.y_data = y
        
        # Create a KDTree for efficient nearest neighbor search
        self.tree = KDTree(X)

        return self

    def gaussian_kernel(self, distance):
        # Gaussian kernel function
        return np.exp(- (distance ** 2) / (2 * self.sigma ** 2))

    def predict(self, X):
        # Convert X to a NumPy array if it's a DataFrame
        if isinstance(X, pd.DataFrame):
            X = X.values

        # We will iteratively load predictions, so it starts empty
        predictions = []
        
        # Loop over rows in the query
        for x in X:
            # Query the tree for the k nearest neighbors
            dist, ind = self.tree.query([x], k=self.k)

            # Calculate weights using the Gaussian kernel
            weights = self.gaussian_kernel(dist[0])

            # Check if weights sum to zero. This happens when all points are very far, giving weights that round to zero, causing divison by zero later. In this case, revert to un-weighted (all weights are one).
            if np.sum(weights) == 0:
                # If weights sum to zero, assign equal weight to all neighbors
                weights = np.ones_like(weights)

            # Weighted average of the labels of the k nearest neighbors
            weighted_avg_label = np.average(self.y_data[ind[0]], weights=weights)
            predictions.append(weighted_avg_label)

        # Return the array of predictions we have created
        return np.array(predictions)

Next, let's define a function `runTrial` that:
1. Loads the GPA data set
2. Splits it into train and test sets
3. Trains different variants of nearest neighbors on the training data
4. Evaluates the models using the testing data
5. Reports the results

In [2]:
# Highlighting the best values in the DataFrame
def highlight_best(row, best_metrics):
    return ['font-weight: bold' if (col in best_metrics and row.name == best_metrics[col]) else '' for col in row.index]

def runTrial():
    # Load the data set
    df = pd.read_csv("https://people.cs.umass.edu/~pthomas/courses/COMPSCI_389_Spring2024/GPA.csv", delimiter=',') # Read GPA.csv, assuming numbers are separated by commas
    #df = pd.read_csv("data/GPA.csv", delimiter=',')

    # We already loaded X and y, but do it again as a reminder
    X = df.iloc[:, :-1]
    y = df.iloc[:, -1]

    # Split the data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=ts, shuffle=True)

    # Model parameters to test
    parameters = [
        {"k": 1, "sigma": None},    # Standard NN
        {"k": 100, "sigma": None},  # Standard k-NN
        {"k": 110, "sigma": 90}     # Weighted k-NN
    ]

    # Dictionary to store results
    results = []

    # Training and evaluating each model
    for param in parameters:
        if param["sigma"] is None:
            model = KNearestNeighbors(k=param["k"])
        else:
            model = WeightedKNearestNeighbors(k=param["k"], sigma=param["sigma"])
        model.fit(X_train, y_train)
        predictions = model.predict(X_test)

        mse = mean_squared_error(predictions, y_test)
        rmse = root_mean_squared_error(predictions, y_test)
        mae = mean_absolute_error(predictions, y_test)
        r2 = r_squared(predictions, y_test)

        results.append({"Model": f"k-NN k={param['k']} sigma={param['sigma']}", 
                        "MSE": mse, "RMSE": rmse, "MAE": mae, "R^2": r2})

    # Creating DataFrame for results
    results_df = pd.DataFrame(results)

    # Finding the best (minimum or maximum) values for each metric
    best_metrics = {
        "MSE": results_df['MSE'].idxmin(),
        "RMSE": results_df['RMSE'].idxmin(),
        "MAE": results_df['MAE'].idxmin(),
        "R^2": results_df['R^2'].idxmax()
    }

    # Apply the highlighting
    styled_results = results_df.style.apply(highlight_best, best_metrics=best_metrics, axis=1)
    display(styled_results)

Run the next cell several times:

In [3]:
runTrial()

Unnamed: 0,Model,MSE,RMSE,MAE,R^2
0,k-NN k=1 sigma=None,1.043778,1.021655,0.78648,-0.585888
1,k-NN k=100 sigma=None,0.544447,0.737867,0.573276,0.172782
2,k-NN k=110 sigma=90,0.544862,0.738148,0.573676,0.172152


**Notice**
We cannot trust the evaluations of which is better! It often flips when we re-run the code.

Yes, this is partially because we used a very small test set (5% of the data).

**Question**: Can this happen when you use a larger portion of the data set (say, 50%)?

**Answer**: 

Yes! Particularly if the performances are very similar or if there is a small total amount of data.

To address this, we need to delve a little into probability and statistics.